{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "daff0844-6030-4c75-9847-0c86424db9f7",
   "metadata": {},
   "source": [
    "# DocArray Hnsw Vector Store\n",
    "\n",
    "[DocArrayHnswVectorStore](https://docs.docarray.org/user_guide/storing/index_hnswlib/) is a lightweight Document Index implementation provided by [DocArray](https://github.com/docarray/docarray) that runs fully locally and is best suited for small- to medium-sized datasets. It stores vectors on disk in hnswlib, and stores all other data in SQLite.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "a5b03067-883b-497e-b600-f894467ef8c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "import logging\n",
    "import textwrap\n",
    "\n",
    "import warnings\n",
    "\n",
    "warnings.filterwarnings(\"ignore\")\n",
    "\n",
    "# stop h|uggingface warnings\n",
    "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n",
    "\n",
    "# Uncomment to see debug logs\n",
    "# logging.basicConfig(stream=sys.stdout, level=logging.INFO)\n",
    "# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))\n",
    "\n",
    "from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader, Document\n",
    "from llama_index.vector_stores import DocArrayHnswVectorStore\n",
    "from IPython.display import Markdown, display"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "779f2eaa-c097-47e5-90cb-b40ce278922f",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"<your openai key>\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "d86c31cd-21a6-4f1d-95ff-04b6e67d4901",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Document ID: 07d9ca27-ded0-46fa-9165-7e621216fd47 Document Hash: 77ae91ab542f3abb308c4d7c77c9bc4c9ad0ccd63144802b7cbe7e1bb3a4094e\n"
     ]
    }
   ],
   "source": [
    "# load documents\n",
    "documents = SimpleDirectoryReader(\"../data/paul_graham\").load_data()\n",
    "print(\"Document ID:\", documents[0].doc_id, \"Document Hash:\", documents[0].doc_hash)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "6543688b-6e0c-4764-b94e-1e3a2c660392",
   "metadata": {},
   "source": [
    "## Initialization and indexing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "a3e4ed7c-7409-41dc-8a60-e079df28a717",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.storage.storage_context import StorageContext\n",
    "\n",
    "\n",
    "vector_store = DocArrayHnswVectorStore(work_dir=\"hnsw_index\")\n",
    "storage_context = StorageContext.from_defaults(vector_store=vector_store)\n",
    "index = GPTVectorStoreIndex.from_documents(documents, storage_context=storage_context)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "a038e416-0f43-4fd8-97e4-a55bc1e40a80",
   "metadata": {},
   "source": [
    "## Querying"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "294d10ec-8f49-4bef-8d08-a3e707178199",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Token indices sequence length is longer than the specified maximum sequence length for this model (1830 > 1024). Running this sequence through the model will result in indexing errors\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " Growing up, the author wrote short stories, programmed on an IBM 1401, and nagged his father to buy\n",
      "him a TRS-80 microcomputer. He wrote simple games, a program to predict how high his model rockets\n",
      "would fly, and a word processor. He also studied philosophy in college, but switched to AI after\n",
      "becoming bored with it. He then took art classes at Harvard and applied to art schools, eventually\n",
      "attending RISD.\n"
     ]
    }
   ],
   "source": [
    "# set Logging to DEBUG for more detailed outputs\n",
    "query_engine = index.as_query_engine()\n",
    "response = query_engine.query(\"What did the author do growing up?\")\n",
    "print(textwrap.fill(str(response), 100))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "5d78975c-f1ab-4243-a172-0353b768a666",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " A hard moment for the author was when he realized that the AI programs of the time were a hoax and\n",
      "that there was an unbridgeable gap between what they could do and actually understanding natural\n",
      "language.\n"
     ]
    }
   ],
   "source": [
    "response = query_engine.query(\"What was a hard moment for the author?\")\n",
    "print(textwrap.fill(str(response), 100))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "ba73872a-585c-46af-b91a-8043ba9a4c89",
   "metadata": {},
   "source": [
    "## Querying with filters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "199493b5-b3a2-4f4f-bbcd-ca0495238c24",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.schema import TextNode\n",
    "\n",
    "nodes = [\n",
    "    TextNode(\n",
    "        text=\"The Shawshank Redemption\",\n",
    "        metadata={\n",
    "            \"author\": \"Stephen King\",\n",
    "            \"theme\": \"Friendship\",\n",
    "        },\n",
    "    ),\n",
    "    TextNode(\n",
    "        text=\"The Godfather\",\n",
    "        metadata={\n",
    "            \"director\": \"Francis Ford Coppola\",\n",
    "            \"theme\": \"Mafia\",\n",
    "        },\n",
    "    ),\n",
    "    TextNode(\n",
    "        text=\"Inception\",\n",
    "        metadata={\n",
    "            \"director\": \"Christopher Nolan\",\n",
    "        },\n",
    "    ),\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "7e486454-8d78-4cf8-a92a-901e192fc767",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.storage.storage_context import StorageContext\n",
    "\n",
    "\n",
    "vector_store = DocArrayHnswVectorStore(work_dir=\"hnsw_filters\")\n",
    "storage_context = StorageContext.from_defaults(vector_store=vector_store)\n",
    "\n",
    "index = GPTVectorStoreIndex(nodes, storage_context=storage_context)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "2e039f54-f7af-4270-8952-7f6a3a18c719",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[NodeWithScore(node=Node(text='director: Francis Ford Coppola\\ntheme: Mafia\\n\\nThe Godfather', doc_id='d96456bf-ef6e-4c1b-bdb8-e90a37d881f3', embedding=None, doc_hash='b770e43e6a94854a22dc01421d3d9ef6a94931c2b8dbbadf4fdb6eb6fbe41010', extra_info=None, node_info=None, relationships={<DocumentRelationship.SOURCE: '1'>: 'None'}), score=0.4634347)]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from llama_index.vector_stores.types import ExactMatchFilter, MetadataFilters\n",
    "\n",
    "\n",
    "filters = MetadataFilters(filters=[ExactMatchFilter(key=\"theme\", value=\"Mafia\")])\n",
    "\n",
    "retriever = index.as_retriever(filters=filters)\n",
    "retriever.retrieve(\"What is inception about?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "15b9689b-5251-47df-97f5-9802dfc93f00",
   "metadata": {},
   "outputs": [],
   "source": [
    "# remove created indices\n",
    "import os, shutil\n",
    "\n",
    "hnsw_dirs = [\"hnsw_filters\", \"hnsw_index\"]\n",
    "for dir in hnsw_dirs:\n",
    "    if os.path.exists(dir):\n",
    "        shutil.rmtree(dir)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1d5288c9-8292-44b2-bf38-a4a439cf58ff",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
